Exploring the elements of the average temperature time series is required to move on. So that further models may take these components into account, it is crucial to understand how each time series component affects the entire time series. Time series data primarily consist of four elements: level, trend, seasonality, and noise. Noise and unpredictability are nonsystematic components, whereas level, trend, and seasonality are systematic. A greater comprehension of a time series may be attained by breaking it down into these parts.
In this exploratory data analysis, the plots will be looking at a time series for the quarterly U.S. total coal consumption only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the analysis to only U.S. for simplification purposes. It also makes sense since all the states follow the same trend of coal consumption.
Time Series Plot
Firstly, it is important to visualize the time series at its most basic level, which is a simple plot of data over time.
Code
coal_df <-read.csv("/Users/raezh1/Documents/Georgetown/ANLY560/website/Time Series/data/coal_us_consumption.csv") |>group_by(period) |>summarise(Consumption =sum(consumption)) |>filter(!period %in%c('2000-Q1', '2000-Q2', '2000-Q3', '2000-Q4'))coal_df_ts <-ts(coal_df$Consumption, start =c(2001,1), end =c(2021,1), frequency =4)
Multiplicative trend means the trend is not linear (curved line), and multiplicative seasonality means there are changes to widths or heights of seasonal periods over time. From the time series graph created above, we should use multiplicative model from decompose() to decompose the coal consumption.
Code
decompose_coal =decompose(coal_df_ts, "multiplicative")autoplotly(decompose_coal) +ggtitle("Decomposition of U.S. Total Coal Consumption")
In R the stl() function performs decomposition of a time series into seasonal, trend and irregular components using loses. stl() will handle any type of seasonality, not only monthly and quarterly data.
The two plots of decomposition are very similar except the remainder from stl() function is less smoothed and more noisy than the remainder from decompose() function. They both show U.S. the U.S. quarterly coal consumption data has seasonality. Before 2010, the trend of the data is increasing. After 2010, the trend of the data is decreasing.
Since the data has a trend, it doesn’t has stationarity as a stationary time series is one whose properties do not depend on the time at which the series is observed. The ACF graph below will show more of its non-stationary feature.
Lag Plots
A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the quarterly U.S. coal consumption data.
Code
library(wesanderson)gglagplot(coal_df_ts, do.lines=FALSE, set.lags =c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)) +xlab("Lags") +ggtitle("Lag Plot of U.S. Total Coal Consumption") +theme_minimal() +ylab("Coal Consumption in short tons") +theme_light() +theme(text=element_text(size=12, family="Palatino")) +labs(fill ="Legend") +scale_color_brewer(palette="Set2") +theme(axis.text.x=element_text(angle=90, hjust=1))
From lag plots here we can see that the coal consumption data has serial correlation as the lag plots are showing a linear pattern, which suggests autocorrelation is present. This is a positive linear trend, so the data has positive autocorrelation. However, it’s hard to spot the seasonality from the lag plots, which makes me believe the data doesn’t have seasonality.
Autocorrelation and Stationarity
The function ACF computes (and by default plots) an estimate of the autocorrelation function of a univariate time series. Function PACF computes (and by default plots) an estimate of the partial autocorrelation function of a univariate time series.
Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting.
acf <-ggAcf(coal_df_ts, 80) +ggtitle("ACF Plot of U.S. Total Coal Consumption")ggplotly(acf)
Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 4th quarter, 8th quarter, and 12th quarter, etc. It means the data has yearly seasonality.
A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.
As well as looking at the time plot of the data, the ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Our ACF graph decreases slowly and we can clearly see the data is not stationary.
Code
pacf <-ggPacf(coal_df_ts, 50) +ggtitle("PACF Plot of U.S. Total Coal Consumption")ggplotly(pacf)
From the PACF plot, we can see a large spike at lag 1 that decreases after a few lags, which indicates a moving average term in the data.
Differencing and Detrending
Differences and Seasonal Differences Estimation with Stationarity and Seasonality Tests
The Augmented Dickey-Fuller test is a type of statistical test called a unit root test.
ADF test is conducted with the following assumptions.
Null Hypothesis (HO): Series is non-stationary or series has a unit root.
Alternate Hypothesis(HA): Series is stationary or series has no unit root.
If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.
Conditions to Reject Null Hypothesis(HO)
Code
adf.test(coal_df_ts |>log())
Augmented Dickey-Fuller Test
data: log(coal_df_ts)
Dickey-Fuller = -2.9525, Lag order = 4, p-value = 0.1858
alternative hypothesis: stationary
The p value of ADF test is greater than 0.05, so we cannot reject null hypothesis and it means the U.S. total coal consumption data is not stationary. The rest result matches with the conclusion we had after observing decomposition plot and ACF plot.
Let’s take the first differencing and do the ADF test again.
Code
adf.test(coal_df_ts |>log() |>diff())
Warning in adf.test(diff(log(coal_df_ts))): p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: diff(log(coal_df_ts))
Dickey-Fuller = -4.3984, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
We can see that after first differencing, the p value is below 0.05 which means it’s stationary now and we don’t need the second differencing.
In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required.
Code
library(urca)coal_df_ts |>ur.kpss() |>summary()
#######################
# KPSS Unit Root Test #
#######################
Test is of type: mu with 3 lags.
Value of test-statistic is: 1.7562
Critical value for a significance level of:
10pct 5pct 2.5pct 1pct
critical values 0.347 0.463 0.574 0.739
The test statistic is much bigger than the 5% critical value, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.
#######################
# KPSS Unit Root Test #
#######################
Test is of type: mu with 3 lags.
Value of test-statistic is: 0.0871
Critical value for a significance level of:
10pct 5pct 2.5pct 1pct
critical values 0.347 0.463 0.574 0.739
This time, the test statistic is tiny, and well within the range we would expect for stationary data, so we can conclude that the differenced data are stationary.
A similar function for determining whether seasonal differencing is required is nsdiffs(), which uses the measure of seasonal strength introduced to determine the appropriate number of seasonal differences required. ndiffs() estimates the number of first differences. No seasonal differences are suggested if the result is less than 0.64, otherwise one seasonal difference is suggested.
We can apply nsdiffs() to the logged U.S. Total Coal Consumption data.
Code
coal_df_ts |>log() |>ndiffs()
[1] 1
Code
coal_df_ts |>log() |>diff() |>nsdiffs()
[1] 1
Code
coal_df_ts |>log() |>diff(lag=4) |>nsdiffs()
[1] 0
Because nsdiffs() returns 1 on the original data(indicating one seasonal difference is required), we apply the function again to the seasonally differenced data. These two functions suggest we should do both a seasonal difference and a first difference.
In conclusion, after trying several differences and seasonal differences combinations, taking one seasonal difference and one difference is what the data needs to be stationary.
Differencing Results
Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.
Code
cbind("Original"= coal_df_ts,"Log Trans"=log(coal_df_ts),"Sea. Diff"=diff(log(coal_df_ts),lag=4),"1st & Sea. Diff"=diff(diff(log(coal_df_ts),lag=4))) |>autoplotly(facets=TRUE) +xlab("Year") +ylab("") +ggtitle("Differencing of U.S. Total Coal Consumption")
New Stationary Data
Finally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.
Code
`Log Transformation`<-log(coal_df_ts)`Remove Seasonality`<-diff(log(coal_df_ts), lag=4)`First Diff After Remove Seasonality`<-diff(diff(log(coal_df_ts), lag=4))a <-ggAcf(coal_df_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")
The first three graphs above are not as ideal as the last ACF graph which is First Diff After Remove Seasonality.
For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it’s not stationary either.
This ACF graph of detrended and differenced U.S. total coal consumption data shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.
Exploring U.S. CO2 Emissions Data
Time Series Plot
Code
emissions_df <-read.csv("/Users/raezh1/Documents/Georgetown/ANLY560/website/Time Series/data/df_total_monthly_CO2_emissions.csv", skip=1, row.names =1)#emissions_ts <- ts(emission_df$sum, star=decimal_date(as.Date("1973-01-01",format = "%Y-%m-%d")),frequency = 12)emissions_ts <-ts(emissions_df$sum, start =c(1973,1), end =c(2022,12), frequency =12)autoplotly(emissions_ts) +xlab("Year") +ylab("Million Metric Tons of Carbon Dioxide") +ggtitle("U.S. Monthly CO2 Emissions")
Multiplicative trend means the trend is not linear (curved line), and multiplicative seasonality means there are changes to widths or heights of seasonal periods over time. From the time series graph created above, we should use multiplicative model from decompose() to decompose the coal consumption.
Code
decompose_emissions =decompose(emissions_ts, "multiplicative")autoplotly(decompose_emissions) +ggtitle("Decomposition of U.S. Total CO2 Emissions")
In R the stl() function performs decomposition of a time series into seasonal, trend and irregular components using loses. stl() will handle any type of seasonality, not only monthly and quarterly data.
Code
emissions_ts |>stl(s.window="periodic", robust=TRUE) |>autoplotly() +ggtitle("Decomposition of U.S. Total CO2 Emissions")
The two plots of decomposition are very similar except the remainder from stl() function is less smoothed and more noisy than the remainder from decompose() function. They both show U.S. the U.S. monthly CO2 emissions data has seasonality. Before 2008, the trend of the data is increasing. After 2008, the trend of the data is decreasing.
Since the data has a trend, it doesn’t has stationarity as a stationary time series is one whose properties do not depend on the time at which the series is observed. The ACF graph below will show more of its non-stationary feature.
Lag Plots
A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the quarterly U.S. coal consumption data.
Code
library(wesanderson)gglagplot(emissions_ts, do.lines=FALSE, set.lags =c(1,12,12*2,12*3,12*4,12*5,12*6,12*7,12*8,12*9,12*10,12*11,12*12,12*13,12*14,12*15)) +xlab("Lags") +ggtitle("Lag Plot of U.S. CO2 Emissions") +theme_minimal() +ylab("Million Metric Tons of Carbon Dioxide") +theme_light() +theme(text=element_text(size=12, family="Palatino")) +labs(fill ="Legend") +#scale_color_brewer(palette="Set2") +theme(axis.text.x=element_text(angle=90, hjust=1))
From lag plots here we can see that the CO2 emissions data has serial correlation as the lag plots are showing a linear pattern, which suggests autocorrelation is present. This is a positive linear trend, so the data has positive autocorrelation. However, it’s hard to spot the seasonality from the lag plots, which makes me believe the data doesn’t have seasonality.
Autocorrelation and Stationarity
The function ACF computes (and by default plots) an estimate of the autocorrelation function of a univariate time series. Function PACF computes (and by default plots) an estimate of the partial autocorrelation function of a univariate time series.
Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting.
acf <-ggAcf(emissions_ts, 80) +ggtitle("ACF Plot of U.S. Total Coal Consumption")ggplotly(acf)
Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 12th, 24th, and 36th, etc. It means the data has yearly seasonality.
A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.
As well as looking at the time plot of the data, the ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Our ACF graph decreases slowly and we can clearly see the data is not stationary.
Code
pacf <-ggPacf(emissions_ts, 80) +ggtitle("PACF Plot of U.S. Total CO2 Emissions")ggplotly(pacf)
From the PACF plot, we can see a large spike at lag 1 that decreases after 13 lags, which indicates a moving average term in the data.
Differencing and Detrending
Differences and Seasonal Differences Estimation with Stationarity and Seasonality Tests
The Augmented Dickey-Fuller test is a type of statistical test called a unit root test.
ADF test is conducted with the following assumptions.
Null Hypothesis (HO): Series is non-stationary or series has a unit root.
Alternate Hypothesis(HA): Series is stationary or series has no unit root.
If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.
Conditions to Reject Null Hypothesis(HO)
Code
adf.test(emissions_ts |>log())
Augmented Dickey-Fuller Test
data: log(emissions_ts)
Dickey-Fuller = -2.4005, Lag order = 8, p-value = 0.4088
alternative hypothesis: stationary
The p value of ADF test is greater than 0.05, so we cannot reject null hypothesis and it means the U.S. total coal consumption data is not stationary. The rest result matches with the conclusion we had after observing decomposition plot and ACF plot.
Let’s do the ADF test after first differencing.
Code
adf.test(emissions_ts |>log() |>diff())
Warning in adf.test(diff(log(emissions_ts))): p-value smaller than printed
p-value
Augmented Dickey-Fuller Test
data: diff(log(emissions_ts))
Dickey-Fuller = -14.873, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary
We can see that after first differencing, the p value is below 0.05 which means it’s stationary now and we don’t need the second differencing.
In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required.
#######################
# KPSS Unit Root Test #
#######################
Test is of type: mu with 6 lags.
Value of test-statistic is: 3.9813
Critical value for a significance level of:
10pct 5pct 2.5pct 1pct
critical values 0.347 0.463 0.574 0.739
The test statistic is much bigger than the 5% critical value, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.
#######################
# KPSS Unit Root Test #
#######################
Test is of type: mu with 6 lags.
Value of test-statistic is: 0.0218
Critical value for a significance level of:
10pct 5pct 2.5pct 1pct
critical values 0.347 0.463 0.574 0.739
This time, the test statistic is small and well within the range we would expect for stationary data, so we can conclude that the differenced data are stationary.
A similar function for determining whether seasonal differencing is required is nsdiffs(), which uses the measure of seasonal strength introduced to determine the appropriate number of seasonal differences required. No seasonal differences are suggested if the result is less than 0.64, otherwise one seasonal difference is suggested.
We can apply nsdiffs() to the logged U.S. Total CO2 Emissions data.
Because both nsdiff() and nsdiffs() returns 1 (indicating one seasonal difference and one difference are required), below we apply the ndiffs() function to the seasonally differenced data.
lag=12 means taking a yearly seasonal difference, and the result is less than 0.64 which means our time series data doesn’t have seasonality anymore after taking a yearly seasonal difference
In conclusion, after trying several differences and seasonal differences combinations, taking one seasonal difference and one difference is what the this data needs to be stationary.
Differencing Results
Code
cbind("Original"= emissions_ts,"Log Trans"=log(emissions_ts),"Sea. Diff"=diff(log(emissions_ts),lag=12),"1st & Sea. Diff"=diff(diff(log(emissions_ts),lag=12))) %>%autoplotly(facets=TRUE) +xlab("Year") +ylab("") +ggtitle("Differencing of U.S. Total CO2 Emissions")
New Stationary Data
Finally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.
Code
`Log Transformation`<-log(emissions_ts)`Remove Seasonality`<-diff(log(emissions_ts), lag=12)`First Diff After Remove Seasonality`<-diff(diff(log(emissions_ts), lag=12))a <-ggAcf(emissions_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")
The first three graphs above are not as ideal as the last ACF graph which is First Diff After Remove Seasonality.
For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it’s not stationary either.
This ACF graph of detrended and differenced U.S. total CO2 Emissions data shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.
Source Code
---title: "Exploratory Data Analysis"format: html: embed-resources: true self-contained: true code-fold: true code-copy: true code-tools: true code-overflow: wrap---# Exploring Coal Consumption DataExploring the elements of the average temperature time series is required to move on. So that further models may take these components into account, it is crucial to understand how each time series component affects the entire time series. Time series data primarily consist of four elements: level, trend, seasonality, and noise. Noise and unpredictability are nonsystematic components, whereas level, trend, and seasonality are systematic. A greater comprehension of a time series may be attained by breaking it down into these parts.In this exploratory data analysis, the plots will be looking at a time series for the quarterly U.S. total coal consumption only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the analysis to only U.S. for simplification purposes. It also makes sense since all the states follow the same trend of coal consumption.| <br>## Time Series PlotFirstly, it is important to visualize the time series at its most basic level, which is a simple plot of data over time.```{r,echo=FALSE,message=FALSE,warning=FALSE}library(tidyverse)library(ggplot2)library(forecast)library(astsa) library(xts)library(tseries)library(fpp2)library(fma)library(lubridate)library(tidyverse)library(TSstudio)library(quantmod)library(tidyquant)library(plotly)library(ggplot2)library(ggfortify)library(autoplotly)library(ggpubr)``````{r}coal_df <-read.csv("/Users/raezh1/Documents/Georgetown/ANLY560/website/Time Series/data/coal_us_consumption.csv") |>group_by(period) |>summarise(Consumption =sum(consumption)) |>filter(!period %in%c('2000-Q1', '2000-Q2', '2000-Q3', '2000-Q4'))coal_df_ts <-ts(coal_df$Consumption, start =c(2001,1), end =c(2021,1), frequency =4)``````{r}theme_set(theme_gray(base_size=12,base_family="Palatino"))autoplotly(coal_df_ts) +ggtitle("U.S. Quarterly Coal Consumption") +xlab("Year (2001-2021)") +ylab("Short Ton")```| <br>## Decomposition::: panel-tabset## **`decompose()` Function***Multiplicative trend* means the trend is not linear (curved line), and *multiplicative seasonality* means there are changes to widths or heights of seasonal periods over time. From the time series graph created above, we should use multiplicative model from `decompose()` to decompose the coal consumption.```{r}decompose_coal =decompose(coal_df_ts, "multiplicative")autoplotly(decompose_coal) +ggtitle("Decomposition of U.S. Total Coal Consumption")```## **`stl()` Function**In R the `stl()` function performs decomposition of a time series into seasonal, trend and irregular components using loses. `stl()` will handle any type of seasonality, not only monthly and quarterly data.```{r}coal_df_ts |>stl(s.window="periodic", robust=TRUE) |>autoplotly()```:::| The two plots of decomposition are very similar except the remainder from stl() function is less smoothed and more noisy than the remainder from decompose() function. They both show U.S. the U.S. quarterly coal consumption data has seasonality. Before 2010, the trend of the data is increasing. After 2010, the trend of the data is decreasing.| Since the data has a trend, it doesn't has stationarity as a stationary time series is one whose properties do not depend on the time at which the series is observed. The ACF graph below will show more of its non-stationary feature.| <br>## Lag PlotsA lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the quarterly U.S. coal consumption data.```{r,warning=FALSE}library(wesanderson)gglagplot(coal_df_ts, do.lines=FALSE, set.lags =c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)) +xlab("Lags") +ggtitle("Lag Plot of U.S. Total Coal Consumption") +theme_minimal() +ylab("Coal Consumption in short tons") +theme_light() +theme(text=element_text(size=12, family="Palatino")) +labs(fill ="Legend") +scale_color_brewer(palette="Set2") +theme(axis.text.x=element_text(angle=90, hjust=1))```From lag plots here we can see that the coal consumption data has serial correlation as the lag plots are showing a linear pattern, which suggests autocorrelation is present. This is a positive linear trend, so the data has positive autocorrelation. However, it's hard to spot the seasonality from the lag plots, which makes me believe the data doesn't have seasonality.| <br>## Autocorrelation and StationarityThe function ACF computes (and by default plots) an estimate of the autocorrelation function of a univariate time series. Function PACF computes (and by default plots) an estimate of the partial autocorrelation function of a univariate time series.Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting.::: panel-tabset# ACF Plot```{r}acf <-ggAcf(coal_df_ts, 80) +ggtitle("ACF Plot of U.S. Total Coal Consumption")ggplotly(acf)```Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 4th quarter, 8th quarter, and 12th quarter, etc. It means the data has **yearly seasonality**.A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary --- the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary --- it does not matter when you observe it, it should look much the same at any point in time.As well as looking at the time plot of the data, the ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Our ACF graph decreases slowly and we can clearly see the data is not stationary.# PACF Plot```{r}pacf <-ggPacf(coal_df_ts, 50) +ggtitle("PACF Plot of U.S. Total Coal Consumption")ggplotly(pacf)```From the PACF plot, we can see a large spike at lag 1 that decreases after a few lags, which indicates a moving average term in the data.:::| <br>## Differencing and Detrending### Differences and Seasonal Differences Estimation with Stationarity and Seasonality Tests::: panel-tabset# **Stationarity Test - Augmented Dickey-Fuller Test**The *Augmented Dickey-Fuller* test is a type of statistical test called a *unit root test*.ADF test is conducted with the following assumptions.*Null Hypothesis (H~O~)*: Series is non-stationary or series has a unit root.*Alternate Hypothesis(H~A~)*: Series is stationary or series has no unit root.If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.Conditions to Reject Null Hypothesis(H~O~)```{r}adf.test(coal_df_ts |>log())```The p value of ADF test is greater than 0.05, so we cannot reject null hypothesis and it means the U.S. total coal consumption data is **not stationary**. The rest result matches with the conclusion we had after observing decomposition plot and ACF plot.Let's take the first differencing and do the ADF test again.```{r}adf.test(coal_df_ts |>log() |>diff())```We can see that after first differencing, the p value is below 0.05 which means it's stationary now and we don't need the second differencing.# **Stationarity Test -** KPSS TestIn this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required.```{r}library(urca)coal_df_ts |>ur.kpss() |>summary()```The test statistic is much bigger than the 5% critical value, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.```{r}coal_df_ts |>log() |>diff(lag=4) |>diff() |>ur.kpss() |>summary()```This time, the test statistic is tiny, and well within the range we would expect for stationary data, so we can conclude that the differenced data are stationary.# Seasonality Test - `nsdiffs()` TestA similar function for determining whether seasonal differencing is required is `nsdiffs()`, which uses the measure of seasonal strength introduced to determine the appropriate number of seasonal differences required. `ndiffs()` estimates the number of first differences. No seasonal differences are suggested if the result is less than 0.64, otherwise one seasonal difference is suggested.We can apply `nsdiffs()` to the logged U.S. Total Coal Consumption data.```{r}coal_df_ts |>log() |>ndiffs()coal_df_ts |>log() |>diff() |>nsdiffs()coal_df_ts |>log() |>diff(lag=4) |>nsdiffs()```Because `nsdiffs()` returns 1 on the original data(indicating one seasonal difference is required), we apply the function again to the seasonally differenced data. These two functions suggest we should do both a seasonal difference and a first difference.:::In conclusion, after trying several differences and seasonal differences combinations, taking one seasonal difference and one difference is what the data needs to be stationary.### Differencing ResultsTransformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.```{r}cbind("Original"= coal_df_ts,"Log Trans"=log(coal_df_ts),"Sea. Diff"=diff(log(coal_df_ts),lag=4),"1st & Sea. Diff"=diff(diff(log(coal_df_ts),lag=4))) |>autoplotly(facets=TRUE) +xlab("Year") +ylab("") +ggtitle("Differencing of U.S. Total Coal Consumption") ```## New Stationary DataFinally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.```{r}`Log Transformation`<-log(coal_df_ts)`Remove Seasonality`<-diff(log(coal_df_ts), lag=4)`First Diff After Remove Seasonality`<-diff(diff(log(coal_df_ts), lag=4))a <-ggAcf(coal_df_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")```### ACF Plots::: panel-tabset## Original and Log Transformation```{r}ggarrange(a, b, ncol =1, nrow =2)```## Detrend and Remove Seasonality```{r}ggarrange(c, d, ncol =1, nrow =2)```:::The first three graphs above are not as ideal as the last ACF graph which is `First Diff After Remove Seasonality`.For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it's not stationary either.This ACF graph of detrended and differenced U.S. total coal consumption data shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.# Exploring U.S. CO2 Emissions Data## Time Series Plot```{r}emissions_df <-read.csv("/Users/raezh1/Documents/Georgetown/ANLY560/website/Time Series/data/df_total_monthly_CO2_emissions.csv", skip=1, row.names =1)#emissions_ts <- ts(emission_df$sum, star=decimal_date(as.Date("1973-01-01",format = "%Y-%m-%d")),frequency = 12)emissions_ts <-ts(emissions_df$sum, start =c(1973,1), end =c(2022,12), frequency =12)autoplotly(emissions_ts) +xlab("Year") +ylab("Million Metric Tons of Carbon Dioxide") +ggtitle("U.S. Monthly CO2 Emissions")```| <br>## Decomposition::: panel-tabset## **`decompose()` Function***Multiplicative trend* means the trend is not linear (curved line), and *multiplicative seasonality* means there are changes to widths or heights of seasonal periods over time. From the time series graph created above, we should use multiplicative model from `decompose()` to decompose the coal consumption.```{r}decompose_emissions =decompose(emissions_ts, "multiplicative")autoplotly(decompose_emissions) +ggtitle("Decomposition of U.S. Total CO2 Emissions")```## **`stl()` Function**In R the `stl()` function performs decomposition of a time series into seasonal, trend and irregular components using loses. `stl()` will handle any type of seasonality, not only monthly and quarterly data.```{r}emissions_ts |>stl(s.window="periodic", robust=TRUE) |>autoplotly() +ggtitle("Decomposition of U.S. Total CO2 Emissions")```:::| The two plots of decomposition are very similar except the remainder from stl() function is less smoothed and more noisy than the remainder from decompose() function. They both show U.S. the U.S. monthly CO2 emissions data has seasonality. Before 2008, the trend of the data is increasing. After 2008, the trend of the data is decreasing.| Since the data has a trend, it doesn't has stationarity as a stationary time series is one whose properties do not depend on the time at which the series is observed. The ACF graph below will show more of its non-stationary feature.| <br>## Lag PlotsA lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the quarterly U.S. coal consumption data.```{r}library(wesanderson)gglagplot(emissions_ts, do.lines=FALSE, set.lags =c(1,12,12*2,12*3,12*4,12*5,12*6,12*7,12*8,12*9,12*10,12*11,12*12,12*13,12*14,12*15)) +xlab("Lags") +ggtitle("Lag Plot of U.S. CO2 Emissions") +theme_minimal() +ylab("Million Metric Tons of Carbon Dioxide") +theme_light() +theme(text=element_text(size=12, family="Palatino")) +labs(fill ="Legend") +#scale_color_brewer(palette="Set2") +theme(axis.text.x=element_text(angle=90, hjust=1))```From lag plots here we can see that the CO2 emissions data has serial correlation as the lag plots are showing a linear pattern, which suggests autocorrelation is present. This is a positive linear trend, so the data has positive autocorrelation. However, it's hard to spot the seasonality from the lag plots, which makes me believe the data doesn't have seasonality.| <br>## Autocorrelation and StationarityThe function ACF computes (and by default plots) an estimate of the autocorrelation function of a univariate time series. Function PACF computes (and by default plots) an estimate of the partial autocorrelation function of a univariate time series.Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting.::: panel-tabset## ACF Plot```{r}acf <-ggAcf(emissions_ts, 80) +ggtitle("ACF Plot of U.S. Total Coal Consumption")ggplotly(acf)```Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 12th, 24th, and 36th, etc. It means the data has **yearly seasonality**.A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary --- the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary --- it does not matter when you observe it, it should look much the same at any point in time.As well as looking at the time plot of the data, the ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Our ACF graph decreases slowly and we can clearly see the data is not stationary.## PACF Plot```{r}pacf <-ggPacf(emissions_ts, 80) +ggtitle("PACF Plot of U.S. Total CO2 Emissions")ggplotly(pacf)```From the PACF plot, we can see a large spike at lag 1 that decreases after 13 lags, which indicates a moving average term in the data.:::| <br>## Differencing and Detrending### Differences and Seasonal Differences Estimation with Stationarity and Seasonality Tests::: panel-tabset# Stationarity Test - Augmented Dickey-Fuller TestThe *Augmented Dickey-Fuller* test is a type of statistical test called a *unit root test*.ADF test is conducted with the following assumptions.*Null Hypothesis (H~O~)*: Series is non-stationary or series has a unit root.*Alternate Hypothesis(H~A~)*: Series is stationary or series has no unit root.If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.Conditions to Reject Null Hypothesis(H~O~)```{r}adf.test(emissions_ts |>log())```The p value of ADF test is greater than 0.05, so we cannot reject null hypothesis and it means the U.S. total coal consumption data is **not stationary**. The rest result matches with the conclusion we had after observing decomposition plot and ACF plot.Let's do the ADF test after first differencing.```{r}adf.test(emissions_ts |>log() |>diff())```We can see that after first differencing, the p value is below 0.05 which means it's stationary now and we don't need the second differencing.# Stationarity Test - KPSS TestIn this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required.```{r}library(urca)emissions_ts |>log() |>ur.kpss() |>summary()```The test statistic is much bigger than the 5% critical value, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.```{r}emissions_ts |>log() |>diff() |>ur.kpss() |>summary()```This time, the test statistic is small and well within the range we would expect for stationary data, so we can conclude that the differenced data are stationary.# `Seasonality Test - nsdiffs()` TestA similar function for determining whether seasonal differencing is required is `nsdiffs()`, which uses the measure of seasonal strength introduced to determine the appropriate number of seasonal differences required. No seasonal differences are suggested if the result is less than 0.64, otherwise one seasonal difference is suggested.We can apply `nsdiffs()` to the logged U.S. Total CO2 Emissions data.```{r}emissions_ts |>log() |>ndiffs()emissions_ts |>log() |>diff() |>nsdiffs()emissions_ts |>log() |>diff() |>diff(lag=12) |>nsdiffs()```Because both `nsdiff()` and `nsdiffs()` returns 1 (indicating one seasonal difference and one difference are required), below we apply the `ndiffs()` function to the seasonally differenced data.`lag=12` means taking a yearly seasonal difference, and the result is less than 0.64 which means our time series data doesn't have seasonality anymore after taking a yearly seasonal difference:::In conclusion, after trying several differences and seasonal differences combinations, taking one seasonal difference and one difference is what the this data needs to be stationary.### **Differencing Results**```{r}cbind("Original"= emissions_ts,"Log Trans"=log(emissions_ts),"Sea. Diff"=diff(log(emissions_ts),lag=12),"1st & Sea. Diff"=diff(diff(log(emissions_ts),lag=12))) %>%autoplotly(facets=TRUE) +xlab("Year") +ylab("") +ggtitle("Differencing of U.S. Total CO2 Emissions") ```## New Stationary DataFinally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.```{r}`Log Transformation`<-log(emissions_ts)`Remove Seasonality`<-diff(log(emissions_ts), lag=12)`First Diff After Remove Seasonality`<-diff(diff(log(emissions_ts), lag=12))a <-ggAcf(emissions_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")```### ACF Plots::: panel-tabset## Original and Log Transformation```{r}ggarrange(a, b, ncol =1, nrow =2)```## Detrend and Remove Seasonality```{r}ggarrange(c, d, ncol =1, nrow =2)```:::The first three graphs above are not as ideal as the last ACF graph which is `First Diff After Remove Seasonality`.For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it's not stationary either.This ACF graph of detrended and differenced U.S. total CO2 Emissions data shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.#